feat: extract & insert sidecar batches in `replay`'s action iterator #679

sebastiantia · 2025-02-06T00:33:37Z

What changes are proposed in this pull request?

Summary

This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be.

This PR specifically adds support for Item 2.

This PR does not introduce full v2Checkpoints reader/writer support as we are missing support for Item 1, meaning log segments can never have V2 checkpoint files in the first place. That functionality will be completed in PR #685 which is stacked on top of this PR. However, the changes to log replay done here are compatible with tables using V1 checkpoints, allowing us to safely merge the changes here.

Changes

For each batch of EngineData from a checkpoint file:

Use the new SidecarVisitor to scan each batch for sidecar file paths embedded in sidecar actions.
If sidecar file paths exist:
- Read the corresponding sidecar files.
- Generate an iterator over batches of actions within the sidecar files.
- Insert the sidecar batches that contain the add actions necessary to reconstruct the table’s state into the top level iterator
  - Note: the original checkpoint batch is still included in the iterator
If no sidecar file paths exist, move to the next batch & leave the original checkpoint batch in the iterator.

Notes:

If the checkpoint_read_schema does not have file actions, we do not need to scan the batch with the SidecarVisitor and can leave the batch as-is in the top-level iterator.
Multi-part checkpoints do not have sidecar actions, so we do not need to scan the batch with the SidecarVisitor and can leave the batch as-is in the top-level iterator.
A batch may not include add actions, but other actions (like txn, metadata, protocol). This is safe to leave in the iterator as the non-file actions will be ignored.

resolves #670

How was this change tested?

Although log segments can not yet have V2 checkpoints, we can easily mock batches that include sidecar actions that we can encounter in V2 checkpoints.

test_sidecar_to_filemeta_valid_paths
- Tests handling of sidecar paths that can either be:
- A relative path within the _delta_log/_sidecars directory, but it is just file-name
- paths that are relative and have a parent (i.e. directory component)
- An absolute path.

Unit tests for process_single_checkpoint_batch:

test_checkpoint_batch_with_no_sidecars_returns_none
- Verifies that if no sidecar actions are present, the checkpoint batch is returned unchanged.
test_checkpoint_batch_with_sidecars_returns_sidecar_batches
- Ensures that when sidecars are present, the corresponding sidecar files are read, and their batches are returned.
test_checkpoint_batch_with_sidecar_files_that_do_not_exist
- Tests behavior when sidecar files referenced in the checkpoint batch do not exist, ensuring an error is returned.

Unit tests for create_checkpoint_stream:

test_create_checkpoint_stream_errors_when_schema_has_remove_but_no_sidecar_action
- Validates that if the schema includes the remove action, it must also contain the sidecar column.
test_create_checkpoint_stream_errors_when_schema_has_add_but_no_sidecar_action
- Validates that if the schema includes the add action, it must also contain the sidecar column.
test_create_checkpoint_stream_returns_checkpoint_batches_as_is_if_schema_has_no_file_actions
- Checks that if the schema has no file actions, the checkpoint batches are returned unchanged
test_create_checkpoint_stream_returns_checkpoint_batches_if_checkpoint_is_multi_part
- Ensures that for multi-part checkpoints, the batch is not visited, and checkpoint batches are returned as-is.
test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars
- Tests reading a Parquet checkpoint batch and verifying it matches the expected result.
test_create_checkpoint_stream_reads_json_checkpoint_batch_without_sidecars
- Verifies that JSON checkpoint batches are read correctly
test_create_checkpoint_stream_reads_checkpoint_batch_with_sidecar
- Test ensuring that checkpoint files containing sidecar references return the additional corresponding sidecar batches correctly

codecov · 2025-02-06T00:37:04Z

Codecov Report

Attention: Patch coverage is 85.73944% with 81 lines in your changes missing coverage. Please review.

Project coverage is 84.42%. Comparing base (ca18e7f) to head (fbe1d87).

Files with missing lines	Patch %	Lines
kernel/src/log_segment/tests.rs	85.21%	5 Missing and 59 partials ⚠️
kernel/src/log_segment.rs	82.02%	4 Missing and 12 partials ⚠️
kernel/src/scan/mod.rs	96.77%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #679      +/-   ##
==========================================
+ Coverage   84.36%   84.42%   +0.06%     
==========================================
  Files          75       75              
  Lines       17654    18202     +548     
  Branches    17654    18202     +548     
==========================================
+ Hits        14893    15367     +474     
- Misses       2052     2055       +3     
- Partials      709      780      +71

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kernel/src/log_segment.rs

OussamaSaoudi · 2025-02-07T04:00:39Z

kernel/src/log_segment.rs

+        let sidecar_files: Result<Vec<_>, _> = visitor
+            .sidecars
+            .iter()
+            .map(|sidecar| Self::sidecar_to_filemeta(sidecar, &log_root))


I wonder if sidecar_to_file_meta could be a closure. We only use this once.

let sidecar_to_filemeta = |sidecar| { let location = log_root.join("_sidecars/")?.join(&sidecar.path)?; Ok(FileMeta { location, last_modified: sidecar.modification_time, size: sidecar.size_in_bytes as usize, }) }

And then map sidecar

visitor .sidecars .iter() .map(sidecar_to_filemeta)

Give it a shot and see how it is.

Do you think it's a good idea to leave it as a separate function for unit testing purposes?

I'd say either keep the separate function (if needed for testing) or embed the logic directly in the map call? What purpose does a separately named closure serve?

(aside: not sure if cargo fmt will like my indentation choice above -- depends on whether the ( or { is more important)

OussamaSaoudi · 2025-02-07T04:04:53Z

kernel/src/log_segment.rs

+    }
+
+    fn process_single_checkpoint_batch(
+        parquet_handler: Arc<dyn ParquetHandler>,


iirc, we want to avoid passing handlers around. Only reference to the engine. I think it's because we want to make it clear that the handler is tied to the engine and not to encourage holding an Arc ref to the handler.

cc @zachschuermann to double check.

I recall running into lifetime issues when passing the entire engine. I believe we would have to explicitly tie the iterator's lifetime to that of the engine?

You can also pass an Arc.

Basically the iterator needs to hold a reference for the entire duration it's lazily evaluating. So you want to give it a reference it can hold for a long time.

But this change extends all the way to changing the scan_data function signature to explicitly tie the engines lifetime to the iterator.

pub fn scan_data<'a>( &self, engine: &'a dyn Engine, ) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanData>> + 'a> {

Mmm, so the basic issue is that we have delayed reading of parquet files, so at some point we want an item off the iterator, and to produce it, we need to read some parquet, so we need a handler. Previously we could do all the read calls up front and then just map off that iterator, so we didn't need an engine ref plumbed through.

I think if this is all internal, i.e., we don't want to expose any of these function signatures to engines (especially in the FFI), then cloning the Arcs is fine (it's very cheap. as a suggestion we usually put // cheap arc clone at those clone sites to make it clear).

If we do want to ever expose this, we'll need to think more, but afaict, we don't.

Thanks for the look @nicklan, just to confirm we will go ahead and clone the parquet handler

kernel/src/log_segment.rs

scovich · 2025-02-08T12:38:34Z

kernel/src/actions/visitors.rs

+                // We read checkpoint batches with the sidecar action. This results in empty paths
+                // if a row is not a sidecar action. We do not want to create a sidecar action for
+                // these rows.
+                if path.is_empty() {


This looks wrong. Are we mishandling column nullability somewhere, that can cause empty strings to be returned instead of NULL?

It seems like this sort of issue has shown up a few times recently -- do we have a lurking bug somewhere? or is null handling just error-prone in general?

I encountered Sidecar actions with empty strings for path when running test_create_checkpoint_stream_reads_parquet_checkpoint_batch at first. I believe it is because of the way I am creating the dummy checkpoint batch (beginning with a json string, specifically add_batch_simple()). The fix is not obvious to me though...

For more context, when I create a dummy engine data batch from a json string (without sidecar actions) with the SyncEngine's json handler:

delta-kernel-rs/kernel/src/engine/arrow_utils.rs

Line 614 in eedfd47

pub(crate) fn parse_json(

and the sidecar action is included in the output_schema, I find the above error case. When the sidecar action is not included in the output_schema to the test util, the sidecar column handles nullability correctly. Does this seem like an issue with the SyncEngine's json handler's core functionality @scovich? This isn't an area I'm deeply familiar with but I'll look into it, my investigation might take a bit longer.

Below is the json string which is converted to engine data with string_array_to_engine_data, which is finally passed to the sync engine.

r#"{"metaData":{"id":"testId","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{"delta.enableDeletionVectors":"true","delta.columnMapping.mode":"none"},"createdTime":1677811175819}}"#,

This code has disappeared in the latest version, did you figure out a fix?

I found a fix for the test case which created batches that had this weird empty-string Sidecar path field.

However, I am concerned that there may be something wrong with the SyncEngine's json handler's functionality as it allowed me to create this malformed batch

Sorry, how was the batch malformed (= physically invalid)?

It seemed like the test was simply passing an empty string, which is schema-compatible. Arrow has no way to know that Delta puts additional constraints on the field value?

Sorry for the confusion, the test case this issue originated from was: test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars I've left a comment below with more context

#692 solved this, right?

kernel/src/actions/visitors.rs

kernel/src/log_segment.rs

scovich · 2025-02-08T13:22:15Z

kernel/src/log_segment.rs

+        // If sidecars files exist, read the sidecar files and return the iterator of sidecar batches
+        // to replace the checkpoint batch in the top level iterator
+        Ok(Right(parquet_handler.read_parquet_files(


This is subtle -- replacing the top-level means all non-file actions will be lost. This is only ~~safe~~ correct if all checkpoint scans are exclusively requesting adds or exclusively requesting non-file actions. I'm pretty sure it will break our inspect-table example that visits all actions during log replay.

We would either need to keep returning the top-level actions unconditionally (safer) or inspect the read schema to see whether we need non-file actions. Simper feels (a lot) safer to me, and seems unlikely to cause any measurable performance hit -- each checkpoint part has thousands of actions, vs. dozens in the top level. manifest.

Based on the above, we might be able to eliminate multiple left/right use sites, by careful management of type signatures. Conceptually, we would always do a map over the top level checkpoint iterator, producing the following output:

let sidecar_content = Self::process_sidecars(top_level_batch, ...); // returns Option<impl Iterator> std::iter::once(top_level_batch).chain(sidecar_content.into_iter().flatten())

We could pass a flag into process_sidecars that short circuits it to None (pretending no sidecars were found), or we could just cheat and do a spurious map call, just to get the correct signature:

std::iter::once(top_level_batch).chain(None.into_iter().flatten())

Ugh... compiler didn't like my toy example that did the spurious map call...

note: no two closures, even if identical, have the same type

thanks for the catch @scovich I was operating under the assumption that

all checkpoint scans are exclusively requesting adds or exclusively requesting non-file actions

would be always true. But keeping the top-level actions unconditionally feels like a much safer approach. I'll move forward with this and have noted this decision in the design doc.

kernel/src/scan/mod.rs

kernel/src/utils.rs

scovich

Shape of the PR looks good. A few questions and nits (and waiting for it to exit "draft" status)

scovich · 2025-02-10T18:06:52Z

kernel/src/actions/visitors.rs

+                // We read checkpoint batches with the sidecar action. This results in empty paths
+                // if a row is not a sidecar action. We do not want to create a sidecar action for
+                // these rows.
+                if path.is_empty() {


This code has disappeared in the latest version, did you figure out a fix?

kernel/src/utils.rs

kernel/src/log_segment.rs

scovich · 2025-02-10T18:25:40Z

kernel/src/log_segment.rs

+                        skip_sidecar_search,
+                    )?;
+
+                    Ok(std::iter::once(Ok((checkpoint_batch, false))).chain(


If you really wanted to get fancy, could do:

let top_iterable = need_nonfile_actions.then(|| Ok((checkpoint_batch, false))); Ok(sidecar_content...chain(top_iterable))

... where need_nonfile_actions comes from a schema test, similar but opposite to the skip_sidecar_search. But it's probably not worth the complexity.

I agree, I'd like to move forward with unconditionally including the checkpoint batch for simplicities sake

scovich · 2025-02-10T18:31:10Z

kernel/src/log_segment.rs

+        let sidecar_files: Result<Vec<_>, _> = visitor
+            .sidecars
+            .iter()
+            .map(|sidecar| Self::sidecar_to_filemeta(sidecar, &log_root))


I'd say either keep the separate function (if needed for testing) or embed the logic directly in the map call? What purpose does a separately named closure serve?

(aside: not sure if cargo fmt will like my indentation choice above -- depends on whether the ( or { is more important)

kernel/src/log_segment.rs

sebastiantia · 2025-02-10T19:45:16Z

kernel/src/log_segment/tests.rs

+
+    mock_table
+        .parquet_checkpoint(
+            add_batch_simple(get_log_add_schema().clone()),


@scovich this is the batch creation I mentioned in the other comment.

Previously, when passing a schema that included the sidecar action to add_batch_simple, I found that the SidecarVisitor would find sidecar actions which had the empty path field

Fixed by #692, right?

scovich · 2025-02-10T20:06:34Z

(nit: changed feat: to feat!: in the PR title to reflect breaking change status)

nicklan

lgtm!

sebastiantia requested a review from OussamaSaoudi February 6, 2025 00:33

github-actions bot assigned sebastiantia Feb 6, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 6, 2025

sebastiantia force-pushed the read-v2-checkpoints branch from b2c5001 to 00af1f9 Compare February 6, 2025 21:32

OussamaSaoudi reviewed Feb 7, 2025

View reviewed changes

sebastiantia closed this Feb 7, 2025

sebastiantia deleted the read-v2-checkpoints branch February 7, 2025 20:27

sebastiantia restored the read-v2-checkpoints branch February 7, 2025 20:27

sebastiantia reopened this Feb 7, 2025

sebastiantia closed this Feb 7, 2025

sebastiantia deleted the read-v2-checkpoints branch February 7, 2025 20:31

sebastiantia restored the read-v2-checkpoints branch February 7, 2025 21:11

sebastiantia reopened this Feb 7, 2025

sebastiantia mentioned this pull request Feb 7, 2025

feat: support the v2Checkpoint reader/writer feature #685

Open

nicklan reviewed Feb 8, 2025

View reviewed changes

kernel/src/log_segment.rs Outdated Show resolved Hide resolved

scovich reviewed Feb 8, 2025

View reviewed changes

sebastiantia changed the title ~~Read v2 checkpoints~~ feat: insert sidecar batches in replay's action iterator when necessary Feb 10, 2025

sebastiantia changed the title ~~feat: insert sidecar batches in replay's action iterator when necessary~~ feat: extract & insert sidecar batches in replay's action iterator Feb 10, 2025

scovich reviewed Feb 10, 2025

View reviewed changes

sebastiantia removed the breaking-change Change that will require a version bump label Feb 10, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 10, 2025

sebastiantia force-pushed the read-v2-checkpoints branch from 541655e to 4631bae Compare February 10, 2025 19:37

sebastiantia commented Feb 10, 2025

View reviewed changes

sebastiantia marked this pull request as ready for review February 10, 2025 19:48

sebastiantia requested review from scovich, nicklan and OussamaSaoudi February 10, 2025 20:00

scovich changed the title ~~feat: extract & insert sidecar batches in replay's action iterator~~ feat!: extract & insert sidecar batches in replay's action iterator Feb 10, 2025

sebastiantia added 23 commits February 21, 2025 11:29

remove redundant type conversions

91a187e

refactor

0b57452

remove redundant .into_iter

433a4bb

handle errors from windows os

2122f59

remove unnecessary empty path check

98cba07

typo

51a34a8

nits

fe22868

infer type

1e8ed59

review & nits

f6370ef

remove test iterator

1133914

review

b60ba43

clippy

eb0f1bb

link issue

df874bb

nits

5ea49d7

nits

306e4ea

test review

9a67e06

nits

59936cf

remove debug statements

fc180a4

review

ce422f7

comments & review

21893d6

typo

6e64916

typo

626f7b4

review

ff67fcf

sebastiantia force-pushed the read-v2-checkpoints branch from 7ec299b to ff67fcf Compare February 21, 2025 19:31

github-actions bot added the breaking-change Change that will require a version bump label Feb 21, 2025

fix arrow imports

fbe1d87

sebastiantia requested review from zachschuermann and OussamaSaoudi February 21, 2025 22:00

sebastiantia removed the breaking-change Change that will require a version bump label Feb 21, 2025

nicklan approved these changes Feb 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract & insert sidecar batches in `replay`'s action iterator #679

feat: extract & insert sidecar batches in `replay`'s action iterator #679

sebastiantia commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

OussamaSaoudi Feb 7, 2025

OussamaSaoudi Feb 7, 2025

sebastiantia Feb 7, 2025

scovich Feb 10, 2025

OussamaSaoudi Feb 7, 2025

sebastiantia Feb 7, 2025

OussamaSaoudi Feb 7, 2025

sebastiantia Feb 8, 2025

nicklan Feb 8, 2025

sebastiantia Feb 8, 2025

scovich Feb 8, 2025

sebastiantia Feb 10, 2025 •

edited

Loading

sebastiantia Feb 10, 2025 •

edited

Loading

scovich Feb 10, 2025

sebastiantia Feb 10, 2025

scovich Feb 10, 2025

sebastiantia Feb 10, 2025

scovich Feb 19, 2025

scovich Feb 8, 2025

scovich Feb 8, 2025

scovich Feb 8, 2025

sebastiantia Feb 9, 2025 •

edited

Loading

scovich left a comment

scovich Feb 10, 2025

scovich Feb 10, 2025

sebastiantia Feb 10, 2025

scovich Feb 10, 2025

sebastiantia Feb 10, 2025 •

edited

Loading

scovich Feb 19, 2025

scovich commented Feb 10, 2025

nicklan left a comment

feat: extract & insert sidecar batches in replay's action iterator #679

Are you sure you want to change the base?

feat: extract & insert sidecar batches in replay's action iterator #679

Conversation

sebastiantia commented Feb 6, 2025 • edited Loading

What changes are proposed in this pull request?

Summary

Changes

How was this change tested?

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastiantia Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

sebastiantia Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastiantia Feb 9, 2025 • edited Loading

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastiantia Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich commented Feb 10, 2025

nicklan left a comment

Choose a reason for hiding this comment

feat: extract & insert sidecar batches in `replay`'s action iterator #679

feat: extract & insert sidecar batches in `replay`'s action iterator #679

sebastiantia commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

sebastiantia Feb 10, 2025 •

edited

Loading

sebastiantia Feb 10, 2025 •

edited

Loading

sebastiantia Feb 9, 2025 •

edited

Loading

sebastiantia Feb 10, 2025 •

edited

Loading